Red Wine Quality EDA by Sergii Bondariev

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Plots Section

Distribution of quality

The plot above confirms that, as described in the accompanying text file, the data is not balanced. There are many more normal wines, than poor or excelent ones.

Distribution of ‘fixed.acidity’

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most of the wines have the fixed acidity in the interval (6, 11).

Distribution of ‘volatile.acidity’

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most of the wines have the volatile acidity in the interval (0.2, 0.9).

Distribution of citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

A big number of wines do not have the citric acid. Most of the wines have the density of the citric acid below 0.5 g/dm^3. There is an outlier (or several outliers) with the citric acid density of 1 g/dm^3.

Distribution of a residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of a residual sugar resembles a normal distribution with a small variance, but with a long right tail. There is a number of outliers that make a tail even longer.

The distribution of chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Similar to a residual sugar.

Distribution of free.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Distribution of total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Distribution of (total.sulfur.dioxide - free.sulfur.dioxide)

Distribution of density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Resembles a normal distribution

Distribution of pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Resembles a normal distribution. All wines are acidic with pH between 2.7 and 4.01.

Distribution of sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Skewed normal distribution with a long right tail and outliers.

Distribution of alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Univariate Analysis

What is the structure of your dataset?

The red wine dataset consists of 1599 observations and 13 variables. The quality was converted to a factor variable.

What is/are the main feature(s) of interest in your dataset?

‘quality’ is the main feature, because it is a dependent variable used in training and evaluating the model, if we are interested in wine quality predictions.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

At this point there is no evidence to discard any of the features, except an identifier ‘X’. There is a thought that free.sulfur.dioxide and total.sulfur.dioxide could be highly correlated, because total contents of SO2 includes free SO2. That said, total.sulfur.dioxide could be neglected in the future and replaced by (total.sulfur.dioxide - free.sulfur.dioxide). However this analysis will be done in the next section.

Did you create any new variables from existing variables in the dataset?

One variable was created along the way, (total.sulfur.dioxide - free.sulful.dioxide), to see the distribution of the bound sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

One unusual distribution was noticed for ‘citric.acid’ feature. So far no change of the form of the data was performed.

Bivariate Plots Section

Variables free.sulfur.dioxide and total.sulfur.dioxide represent the same chemical component. It is likely to see the dependency between them. Let’s make a scatterplot.

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

We see here some correlation. It is implicitly written in the text file that free.sulfur.dioxide is included into the total.sulfur.dioxide. Let’s create a new variable, bound.sulfur.dioxide = total.sulfur.dioxide - free.sulfur.dioxide. The goal is to make two variables less dependent. Then we make a scatterplot again and compare it to the previous scatterplot.

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and bound.sulfur.dioxide
## t = 18.771, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3841336 0.4644895
## sample estimates:
##       cor 
## 0.4251489

We can see this relationship has a similar shape, but a smaller correlation. In case of modeling it is better to use free.sulfur.dioxide and bound.sulfur.dioxide in place of total.sulfur.dioxide.

Lets plot a matrix on a sample of 500 points

The matrix reveals some interesting relationships, for example, the ones with correlation higher than 0.5. Such cases will be examined next.

First, let’s check the dependency of pH of different kinds of acid, because as we know from chemistry, acidity results in a change of pH. The question is, can we exclude pH from the future analysis, treating it as a dependent variable ?

pH vs. Citric Acid

## 
##  Pearson's product-moment correlation
## 
## data:  pH and citric.acid
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5756337 -0.5063336
## sample estimates:
##        cor 
## -0.5419041

We can clearly see the correlation as expected. More acid means smaller pH.

pH vs. Fixed Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  pH and fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

We can clearly see the correlation as expected. Again, more acid means smaller pH.

pH vs. Volatile Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  pH and volatile.acidity
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1880823 0.2807254
## sample estimates:
##       cor 
## 0.2349373

However with volatile acidity, the picture is not the same. The correlation is small and with a different sign.

As a conclusion about pH, it is logical to see that pH decreases as citric acidity or fixed acidity increase.

Also, it is likely that citric acidity is a part of a fixed acidity. Let’s check this.

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

We can see the high correlation here. However the amount of citric acid in a wine compared to fixed acidity is much smaller. Maybe another effect here ? Let’s verify the correlation between citric.acid and other acids in the fixed.acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and fixed.acidity - citric.acid
## t = 30.199, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5707406 0.6332009
## sample estimates:
##       cor 
## 0.6028937

We can see that correlation is only little but lower. It suggests that the wines with higher levels of citric acid will also have higher levels of other fixed acids. This can be an interesting finding.

We can also check if volatile acidity is correlated with a fixed acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and fixed.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309

As we can see, the correlation is very small.

Just curious, is density dependent on a fixed acidity ?

Density vs. Fixed Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

We see a positive correlation here. Interesting finding, it is not straightforward that density would depend on acidity. Some common underlying factor could be a reason.

How about density with chlorides ? Chlorides is salt and it is soluble in water, therefore we can expect the higher density with more salt. Let’s plot them.

Density vs. Chlorides

There are some outliers, let’s exclude them and plot again.

The dependency is positive as expected. What about sugars ? Should be a similar story.

Density vs. Residual Sugar

Again, let’s exclude the outliers.

Similarly, the positive dependency is observed. As a result, chlorides and residual sugars both add to density, which is logical, because they are soluble in water.

Now an interesting question is how density is dependent on alcohol ?

Density vs. Alcohol

Wines with a higher alcohol level tend to have smaller density.

Now we will make many boxplots of features per quality rating. The goal is to find out on what features does the quality depend more.

These plots show that quality has a linear dependency on volatile acidity, citric acid, sulphates and alcohol. There are other dependencies which do not appear to be linear.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • The quality of wine decreases as volatile acidity increases.
  • Wines with higher levels of citric acid tend to be of higher quality.
  • The quality looks uncorrelated with a residual sugar and chlorides.
  • Smaller density wines have better quality.
  • Smaller pH level wines have better quality.
  • The more sulphates the wine has, the higher is its quality.
  • Wines with a higher percentage of alcohol tend to be of higher quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • It is logical to see that pH decreases as citric acidity or fixed acidity increase. In addition, the analysis shows that a citric acidity can be thought as a component of a fixed acidity. We can try to exclude pH level from the predictors of wine quality.
  • Chlorides and residual sugars both add to density, which is natural, because they are soluble in water.
  • Wines with a higher alcohol level tend to have smaller density. Based on this observation and two premises in the section above, “Smaller density wines have better quality” and “Wines with a higher percentage of alcohol tend to be of higher quality”, we can conclude that density can be excluded from the main predictors of wine quality.

What was the strongest relationship you found?

  • free SO2 and total SO2 are highly correlated, because total SO2 includes free SO2.
  • citric acid and fixed acidity are highly correlated too, for the same reason.

Multivariate Plots Section

To make multivariate plots, we will transform several continuous variable of interest into discrete by cutting them into bins. The width of the bins is chosen in a way to balance them. The chosen variables for binning are alcohol, citric acid and sulphates, because the bivariate analysis has shown the nice dependency of quality on these features.

We will also exlcude from analysis the quality scores 3 and 8, because they have a much smaller number of observations compared to scores 4, 5, 6, and 7.

We will start with the analysis of quality variable as it is the main variable of interest. Because the quality is a factor variable, we will use boxplots and pick several combinations of variables.

How does quality depend on volatile acidity and citric acid ? Let’s make two types of boxplots.

We find that given a citric acid level, the wine quality increases as volatile acidity decreases. And for the same quality level, wines with higher contents of the citric acid appear to have a smaller volatile acidity.

How does quality depend on volatile acidity and alcohol ? Let’s make two types of boxplots.

It can be seen that given an alcohol level, the wine quality increases as volatile acidity decreases. And for the same quality level, wines with higher contents of the alcohol appear to have a smaller volatile acidity.

How does quality depend on volatile acidity and sulphates ? Let’s make two types of boxplots.

We can see that given sulphates level, the wine quality increases as volatile acidity decreases. And for the same quality level, wines with higher contents of the sulphates appear to have a smaller volatile acidity.

Now let’s try scatterplots faceted by quality. It can give us additional insights into the data.

First we can check Volatile Acidity vs. Alcohol by Quality.

We can see here the similar things, lower quality with higher volatile acidity given a level of alcohol. And given the alcohol level, the quality score if higher for a higher alcohol percentage.

Next we can check Volatile Acidity vs. Sulphates by Quality.

The similar picture is observed. Lower quality for higer volatile acidity given sulphates, and higher quality with higher sulphates given volatile acidity. Outliers exist as well.

It is worth to see the dependency among variables, besides a quality variable. How are alcohol, volatile acidity and sulphates related ?

That’s interesting, for the given alcohol level, wines with higher levels of sulphates tend to have smaller volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most of the analysis was dedicated to studying the dependency of wine quality on other features in the dataset. The analysis shows the following:

  • Given a citric acid level, the wine quality increases as volatile acidity decreases
  • For the same quality level, wines with higher contents of the citric acid appear to have a smaller volatile acidity
  • Given alcohol level, the wine quality increases as volatile acidity decreases
  • For the same quality level, wines with higher contents of the alcohol appear to have a smaller volatile acidity
  • Given sulphates level, the wine quality increases as volatile acidity decreases
  • For the same quality level, wines with higher contents of the sulphates appear to have a smaller volatile acidity

Were there any interesting or surprising interactions between features?

  • for the given alcohol level, wines with higher levels of sulphates tend to have smaller volatile acidity

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

The models were not created.


Final Plots and Summary

Plot One

Description One

In the plot above we observe a positive dependency between density and a fixed acidity. The similar relashionship is found for ‘density vs. sugars’ and ‘density vs. chlorides’. While almost everybody knows that sugar and salt are soluble in water and therefore would add to density, not everyone knows that wines with higher fixed acidity tend to have a higher density.

Plot Two

Description Two

As we can see on the plot above, the higher citric acid level in the wine improves its quality. The results coincide with the opinion, that citric acid can add ‘freshness’ and flavor to wines.

Plot Three

Description Three

Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

The analysis of the plot above reveals that given sulphates level, the wine quality increases as volatile acidity decreases. This agrees with the description of volatile acidity. However we can also notice that for the same quality level, wines with higher contents of the sulphates appear to have a smaller volatile acidity. It leaves the spot for discussion to figure out what stays behind this interesting dependency.


Reflection

The goal of the exploratory data analysis of the red wine dataset was to determine the relashionship between existing variables, especially between ‘Quality’ and the rest of variables, communicate the findings to a reader and to prepare for the machine learning step.

In the first section, the univariate analysis was done to see and understand the distributions of every variable in the dataset. In particular it was detected that the data is not balanced and most of the quality scores are in the middle. This fact helped to read the plots in subsequent sections with deeper understanding.

The second section described the process and the results of a bivariate analysis. Here it was shown that some of the variables, besides ‘Quality’, are not independent, which is often an assumtion in machine learning models. Additionally some dependencies of ‘Quality’ on other variables were found, some of which seemed to be linear, while some other could be approximated by a higher order polynomial. The bivariate analysis helped to prepare for the multivariate analysis by selecting the variables to analyse further.

In the course of Multivariate analysis the goal was to explain the variance of variables as a function of Quality using another variable(s). The boxplot was chosen as a base, and the ‘third’ dimension was introduced via new factor variables, that were created by cutting the range of existing into adjacent groups. Multivariate analysis helped to find dependencies, some of which can be explained by the writer, while others are not so easily explicable.

The difficulties in the analysis were on an interpretation side because of the lack of the chemistry expertise. There is a great suspicion that there are dependencies between variables which are left uncovered in this analysis. While some of the unexplained relationships found can appear to be easily explained by a chemist.

Still some success was achieved to reveal existing relationships between features. This would help in the subsequent machine learning step, if one is undertaken in order to predict a wine qualitiy. More importantly for the author, it has strengthened the ways to explore various kinds of data using a varienty of application packages suitable for the task.

The analysis of wines dataset shouldn’t stop with the analysis presented in this article. One of the areas to improve is to get more data observations and more variables, such as vintage year, weather, price, etc. The dataset analysed is very small and there are many other possible variables out there. It is also necessary to understand better the setting under which the wines were evaluated by exprets. This can help to see whether any biases into the data could potentially be introduced. To get more understanding of the data, the machine learning algorithms could be used. If the goal is not only to make predictions, but to understand data as well, not all methods can be appropriate. For example, tree based methods can be used to show dependencies in data, while random forests, for example, may not be appropriate, because the results are hard to explain.

Thank you, Sergii Bondariev